{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 9 - Filtering\n", "\n", "Filtering refers to finding rows in the data that match a specific criteria, and is an important skill in data science. We are going to practice filtering in this lab, using a dataset of the top 1000 films in the IMDb database.\n", "\n", "Download the dataset here: [http://comet.lehman.cuny.edu/owen/teaching/mat128/imdb_1000.csv](http://comet.lehman.cuny.edu/owen/teaching/mat128/imdb_1000.csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, load the CSV file into a dataframe called `movies`, and check that it was loaded properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which of the columns contain quantitative data? Which columns contain qualitative data?\n", "\n", "## Filtering qualitative data\n", "\n", "Suppose we wanted to know the median star rating of Crime films. First make a filter to find all Crime films. Notice that for qualitative data, we need quotes around the category we are looking for." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "crime_filter = movies[\"genre\"] == \"Crime\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply the filter to our `movies` dataframe using [ ]. What does `movies[crime_filter]` do? Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make a new dataframe of only crime films by saving this filtered dataframe in a new variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the new dataframe, find the median star rating." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Pattern:\n", " dataframe_name[\"column_name\"].median()\n", "
\n", "\n", "## Filtering quantitative data\n", "\n", "When we are filtering quantitative data, we don't need quotes around the number, and we can use any of >, >=, ==, <=, < in the filter.\n", "\n", "For example, suppose we wanted to know how many movies are three hours or longer. First we make a filter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "long_filter = movies[\"duration\"] >= 180" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movies[long_filter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find out how many movies are three hours or longer, we could either count the number of rows above, or get the computer to do it for us. Can you remember how from Lab 7?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " len(movies[long_filter])
\n", " or
\n", " movies[long_filter].shape[0]\n", "
\n", "\n", "## Applying two filters (and)\n", "\n", "Suppose we want to know the most popular genre of PG-13 movies that are shorter than 2 hours. We need to make a filter to select PG-13 movies, a filter to select movies shorter than 2 hours, and then we need to apply both filters at the same time.\n", "\n", "First, make a filter for PG-13 movies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " pg13_filter = movies[\"content_rating\"] == \"PG-13\"\n", "
\n", "\n", "Next make a filter for movie shorter than 2 hours:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " short_filter = movies[\"duration\"] < 120\n", "
\n", "\n", "To apply both filters, we use the & (\"and\") symbol, since we want pg13_filter AND short_filter to be true:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "short_pg13_movies = movies[pg13_filter & short_filter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the dataframe `short_pg13_movies`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the most popular genre in this subset of movies?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering for one or more conditions\n", "\n", "What if we have two conditions and want at least one of them to hold? For example, what if we really like dramas, but are also willing to watch any highly rated movie, and want to find the shortest such movie. In this case, we want to find any movie that is a drama or rated 8.8 or higher.\n", "\n", "First, create a filter for dramas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " drama_filter = movies[\"genre\"] == \"Drama\"\n", "
\n", "\n", "Next create a film for movies with star rating of 8.8 or higher:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", " top_filter = movies[\"star_rating\"] >= 8.8\n", "
\n", "\n", "We can find all movies meeting at least one of the filters using | (or), which is above the enter/return key:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "top_or_dramas = movies[drama_filter | top_filter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the new dataframe by displaying it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, find the shortest movie in our new dataframe. First let's get the row index of the movie with the shortest durction (see Lab 5 or the hint below):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Hint:\n", " The pattern is: dataframe_name[\"column_name\"].idxmin()\n", "
\n", "\n", "To get the row in our top/drama movies dataframe, we should use `.loc[]` instead of `.iloc[]` as in Lab 5, since `.loc[]` looks for the row index not the row number, and we only have a subset of the original indexes (so the indexes no longer match the row numbers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }